Predicting Bank Marketing Succuss on Term Deposit Subsciption¶
Summary¶
In this analysis, we attempt to build a predictive model aimed at determining whether a client will subscribe to a term deposit, utilizing the data associated with direct marketing campaigns, specifically phone calls, in a Portuguese banking institution.
After exploring on several models (logistic regression, KNN, decision tree, naive Bayers), we have selected the logistic regression model as our primary predictive tool. The final model performs fairly well when tested on an unseen dataset, achieving the highest AUC (Area Under the Curve) of 0.899. This exceptional AUC score underscores the model's capacity to effectively differentiate between positive and negative outcomes. Notably, certain factors such as last contact duration, last contact month of the year and the clients' types of jobs play a significant role in influencing the classification decision.
Introduction¶
In the banking sector, the evolution of specialized bank marketing has been driven by the expansion and intensification of the financial sector, introducing competition and transparency. Recognizing the need for professional and efficient marketing strategies to engage an increasingly informed and critical customer base, banks grapple with conveying the complexity and abstract nature of financial services. Precision in reaching specific locations, demographics, and societies has proven challenging. The advent of machine learning has revolutionized this landscape, utilizing data and analytics to inform banks about customers more likely to subscribe to financial products. In this machine learning-driven bank marketing project, we explore how a particular Portuguese bank can leverage predictive analytics to strategically prioritize customers for subscribing to a bank term deposit, showcasing the transformative potential of machine learning in refining marketing strategies and optimizing customer targeting for financial institutions.
Data¶
Our analysis centers on direct marketing campaigns conducted by a prominent Portuguese banking institution, specifically phone call campaigns designed to predict clients' likelihood of subscribing to a bank term deposit. The comprehensive dataset provides a detailed view of these marketing initiatives, offering valuable insights into factors influencing client subscription decisions. The dataset, named 'bank-full.csv,' encompasses all examples and 17 inputs, ordered by date. The primary focus of our analysis is classification, predicting whether a client will subscribe ('yes') or not ('no') to a term deposit, providing crucial insights into client behavior in response to direct marketing initiatives. Through rigorous exploration of these datasets, we aim to uncover patterns and trends that can inform and enhance the effectiveness of future marketing campaigns.
Methods¶
In the present analysis, and to , this paper compares the results obtained with four most known machine learning techniques: Logistic Regression (LR),Naïve Bayes (NB) Decision Trees (DT), KNN, and Logistic Regression (LR) yielded better performances for all these algorithms in terms of accuracy and f-measure. Logistic Regression serves as a key algorithm chosen for its proficiency in uncovering associations between binary dependent variables and continuous explanatory variables. Considering the dataset's characteristics, which include continuous independent variables and a binary dependent variable, Logistic Regression emerges as a suitable classifier for predicting customer subscription in the bank's telemarketing campaign for term deposits. The classification report reveals insights into model performance, showcasing trade-offs between precision and recall. While achieving an overall accuracy of 83%, the Logistic Regression model demonstrates strengths in identifying positive cases, providing a foundation for optimizing future marketing strategies.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import confusion_matrix,f1_score, roc_auc_score, classification_report, recall_score, precision_score
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN, BorderlineSMOTE, KMeansSMOTE
from imblearn.under_sampling import ClusterCentroids, RandomUnderSampler
import warnings
url = 'https://archive.ics.uci.edu/static/public/222/data.csv'
request = requests.get(url)
with open("../data/raw/bank-full.csv", 'wb') as f:
f.write(request.content)
Global Config¶
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.3f}'.format
RANDOM_STATE = 522
warnings.filterwarnings("ignore")
Pre-Exploration¶
bank = pd.read_csv('../data/raw/bank-full.csv', sep=',')
bank.columns
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
'loan', 'contact', 'day_of_week', 'month', 'duration', 'campaign',
'pdays', 'previous', 'poutcome', 'y'],
dtype='object')
bank.shape
(45211, 17)
bank.head()
| age | job | marital | education | default | balance | housing | loan | contact | day_of_week | month | duration | campaign | pdays | previous | poutcome | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | management | married | tertiary | no | 2143 | yes | no | NaN | 5 | may | 261 | 1 | -1 | 0 | NaN | no |
| 1 | 44 | technician | single | secondary | no | 29 | yes | no | NaN | 5 | may | 151 | 1 | -1 | 0 | NaN | no |
| 2 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | NaN | 5 | may | 76 | 1 | -1 | 0 | NaN | no |
| 3 | 47 | blue-collar | married | NaN | no | 1506 | yes | no | NaN | 5 | may | 92 | 1 | -1 | 0 | NaN | no |
| 4 | 33 | NaN | single | NaN | no | 1 | no | no | NaN | 5 | may | 198 | 1 | -1 | 0 | NaN | no |
bank.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45211 entries, 0 to 45210 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 45211 non-null int64 1 job 44923 non-null object 2 marital 45211 non-null object 3 education 43354 non-null object 4 default 45211 non-null object 5 balance 45211 non-null int64 6 housing 45211 non-null object 7 loan 45211 non-null object 8 contact 32191 non-null object 9 day_of_week 45211 non-null int64 10 month 45211 non-null object 11 duration 45211 non-null int64 12 campaign 45211 non-null int64 13 pdays 45211 non-null int64 14 previous 45211 non-null int64 15 poutcome 8252 non-null object 16 y 45211 non-null object dtypes: int64(7), object(10) memory usage: 5.9+ MB
bank.y.value_counts()/len(bank)
y no 0.883 yes 0.117 Name: count, dtype: float64
Pay attention that the target is class-imbalanced
Train Test Split¶
bank_train, bank_test = train_test_split(bank
, test_size=0.2
, random_state=RANDOM_STATE
, stratify=bank.y
)
bank_train.y.value_counts()/len(bank_train)
y no 0.883 yes 0.117 Name: count, dtype: float64
X_train, y_train = bank_train.drop(columns=["y"]), bank_train["y"]
X_test, y_test = bank_test.drop(columns=["y"]), bank_test["y"]
Via stratified split, we managed to keep the distribution of the label in the original dataset.
EDA¶
for i in list(bank_train.columns):
print(f"{i:<10}-> {bank_train[i].nunique():<5} unique values")
age -> 77 unique values job -> 11 unique values marital -> 3 unique values education -> 3 unique values default -> 2 unique values balance -> 6601 unique values housing -> 2 unique values loan -> 2 unique values contact -> 2 unique values day_of_week-> 31 unique values month -> 12 unique values duration -> 1506 unique values campaign -> 47 unique values pdays -> 536 unique values previous -> 40 unique values poutcome -> 3 unique values y -> 2 unique values
bank_int = list(bank_train.select_dtypes(include = ['int64']).columns)
bank_str = list(bank_train.select_dtypes(include = ['object']).columns)
bank_categorical = bank_str+['day']
bank_categorical
['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'y', 'day']
Data Visualization¶
We plotted the distributions of each predictor from the training data set and grouped and coloured the distribution by class (yes:green and no:blue).
import altair as alt
alt.data_transformers.disable_max_rows()
charts = []
for i, var in enumerate(bank_categorical):
if i == 9:
break
num_rows = len(bank_train[var].unique())
chart = alt.Chart(bank_train).mark_bar(stroke=None).encode(
x=alt.X('count()', title='Count'),
y=alt.Y('y:N', title=None),
color=alt.Color('y:N', scale=alt.Scale(range=['#3C6682', '#45A778'])),
row=alt.Row(f'{var}:N')
).properties(
width=300,
height=300 / num_rows,
title=f'Grouped Bar Plot for {var}',
spacing=0
)
charts.append(chart)
final_chart = alt.concat(*charts, columns=3).configure_axis(grid=False).configure_header(
labelAngle=0,
labelAlign='left'
)
final_chart